Random Forests and Adaptive Nearest Neighbors
نویسندگان
چکیده
In this paper we study random forests through their connection with a new framework of adaptive nearest neighbor methods. We first introduce a concept of potential nearest neighbors (k-PNN’s) and show that random forests can be seen as adaptively weighted k-PNN methods. Various aspects of random forests are then studied from this perspective. We investigate the effect of terminal node sizes and splitting schemes on the performance of random forests. It has been commonly believed that random forests work best using largest trees possible. We derive a lower bound to the rate of the mean squared error of regression random forests with non-adaptive splitting schemes and show that, asymptotically, growing largest trees in such random forests is not optimal. However, it may take a very large sample size for this asymptotic result to kick in for high dimensional problems. We illustrate with simulations the effect of terminal node sizes on the prediction accuracy of random forests with other splitting schemes. In general, it is advantageous to tune the terminal node size for best performance of random forests. We further show that random forests with adaptive splitting schemes assign weights to k-PNN’s in a desirable way: for the estimation at a given target point, these random forests assign voting weights to the k-PNN’s of the target point according to the local importance of different input variables. We propose a new simple splitting scheme that achieves desirable adaptivity in a straightforward fashion. This simple scheme can be combined with existing algorithms. The resulting algorithm is computationally faster, and gives comparable results. Other possible aspects of random forests, such as using linear combinations in splitting, are also discussed. Simulations and real datasets are used to illustrate the results.
منابع مشابه
The Performance of small samples in quantifying structure central Zagros forests utilizing the indexes based on the nearest neighbors
Abstract Todaychr('39')s forest structure issue has converted to one of the main ecological debates in forest science. Determination of forest structure characteristics is necessary to investigate stands changing process, for silviculture interventions and revival operations planning. In order to investigate structure of the part of Ghale-Gol forests in Khorramabad, a set of indices such as Cla...
متن کاملAdaptively Discovering Meaningful Patterns in High-Dimensional Nearest Neighbor Search
To query high-dimensional databases, similarity search (or k nearest neighbor search) is the most extensively used method. However, since each attribute of high dimensional data records only contains very small amount of information, the distance of two high-dimensional records may not always correctly reflect their similarity. So, a multi-dimensional query may have a k-nearest-neighbor set whi...
متن کاملEstimation of Density using Plotless Density Estimator Criteria in Arasbaran Forest
Sampling methods have a theoretical basis and should be operational in different forests; therefore selecting an appropriate sampling method is effective for accurate estimation of forest characteristics. The purpose of this study was to estimate the stand density (number per hectare) in Arasbaran forest using a variety of the plotless density estimators of the nearest neighbors sampling me...
متن کاملAn adaptive classification method for multimedia retrieval
Relevance feedback can effectively improve the performance of content-based multimedia retrieval systems. To be effective, a relevance feedback approach must be able to efficiently capture the user’s query concept from a very limited number of training samples. To address this issue, we propose a novel adaptive classification method using random forests, which is a machine learning algorithm wi...
متن کاملImputation of Missing Values for Unsupervised Data Using the Proximity in Random Forests
This paper presents a new procedure that imputes missing values by random forests for unsupervised data. We found that it works pretty well compared with k-nearest neighbor (kNN) and rough imputations replacing the median of the variables. Moreover, this procedure can be expanded to semisupervised data sets. The rate of the correct classification is higher than that of other conventional method...
متن کامل